In this project, a database containing the health status, as well as many other related factors, for all countries will be analyzed.
Firstly, an initial exploratory analysis of the data will be conducted to identify potential missing values and outliers, and corresponding decisions will be made to address them.
Secondly, a Principal Component Analysis (PCA) will be performed. This involves condensing the information provided by the original variables into a few linear combinations, aiming to achieve dimensionality reduction. These combinations seek maximum variance and are perpendicular to each other, capturing the directions in which observations vary the most and are uncorrelated with each other.
Thirdly, a Factor Analysis (FA) will be carried out, where latent variables (unobservable) with a high correlation to a group of observable variables and virtually no correlation with the rest will be identified. This results in further dimensionality reduction.
Finally, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) will be conducted, verifying the necessary assumptions of normality. Discriminant Analysis is a method for classifying qualitative variables, allowing the classification of new observations based on their characteristics (explanatory or predictor variables) into different categories of the qualitative response variable.
Loading/installation of R packages necessary for this practice.
#install.packages("readr")
#Package required to call the 'read.csv' function.
library(readr)
#install.packages("pastecs")
#Package required to call the 'stat.desc' function.
library(pastecs)
#install.packages("summarytools")
#Package required to call the 'freq' function.
library(summarytools)
#install.packages("MASS")
#Package required to call the 'lda' function.
library(MASS)
#install.packages("psych")
#Package required to call the 'cortest.bartlett' function.
library(psych)
#install.packages("polycor")
#Package required to call the 'hetcor' function.
library(polycor)
#install.packages("ggcorrplot")
#Package required to call the 'ggcorrplot' function.
library(ggcorrplot)
#install.packages("corrplot")
#Package required to call the 'corrplot' function.
library(corrplot)
#install.packages("ggplot2")
#Package required to call the 'ggplot' function.
library(ggplot2)
#install.packages("stats")
#Package required to call the 'factanal' function.
library(stats)
#install.packages("ggpubr")
#Package required to call 'ggarrange' function.
library(ggpubr)
#install.packages("reshape2")
# Package required to call 'melt' function.
library(reshape2)
#install.packages("biotools")
#Package required to call 'boxM' function.
library(biotools)
#install.packages("energy")
#Package required to call 'mvnorm.etest' function.
library(energy)
#install.packages("MVN")
#Package required to call 'mvn' function.
library(MVN)
#install.packages("klaR")
#Package required to call 'partimat' function.
library(klaR)
The dataset is taken from the Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The data is from year 2000-2015 for 193 countries. It contains the following variables:
The aim is to identify how those factors affect life expectancy, such factors being demographic variables, income composition, mortality rates, immunization, human development index, social and economic factors.
To load our data set we will use the function read.csv.
#Set a CRAN mirror.
options(repos = c(CRAN = "https://cloud.r-project.org"))
#Loading the data set, expect.
expect<-read_csv("LED.csv")
nuevos_nombres <- c("Country", "Year", "Status", "Life expectancy", "Adult Mortality", "infant deaths", "Alcohol", "percentage expenditure", "Hepatitis B", "Measles", "BMI", "under-five deaths", "Polio", "Total expenditure", "Diphtheria", "HIV/AIDS", "GDP", "Population", "thinness 1-19 years", "thinness 5-9 years", "Resource Income", "Schooling")
names(expect) <- nuevos_nombres
class(expect) #it is a data frame.
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
colnames(expect) #to view the different variables previously mentioned.
## [1] "Country" "Year" "Status"
## [4] "Life expectancy" "Adult Mortality" "infant deaths"
## [7] "Alcohol" "percentage expenditure" "Hepatitis B"
## [10] "Measles" "BMI" "under-five deaths"
## [13] "Polio" "Total expenditure" "Diphtheria"
## [16] "HIV/AIDS" "GDP" "Population"
## [19] "thinness 1-19 years" "thinness 5-9 years" "Resource Income"
## [22] "Schooling"
Now we identify the NA values and create a vector with the variables that have more than 5% of NA. For that we create a function called percentageNA and we use an apply function to apply it to the data frame
#We create a function to know which variables have more than 5% missing values.
percentageNA<-function(data){
c=(sum(is.na(data)))/length(data)*100
return(c)
}
#Next, we apply the previous function to our dataframe.
contNA<-apply(expect,2,percentageNA)
contNA
## Country Year Status
## 0.0000000 0.0000000 0.0000000
## Life expectancy Adult Mortality infant deaths
## 0.3403676 0.3403676 0.0000000
## Alcohol percentage expenditure Hepatitis B
## 6.6031314 0.0000000 18.8223281
## Measles BMI under-five deaths
## 0.0000000 1.1572498 0.0000000
## Polio Total expenditure Diphtheria
## 0.6466984 7.6923077 0.6466984
## HIV/AIDS GDP Population
## 0.0000000 15.2484683 22.1919673
## thinness 1-19 years thinness 5-9 years Resource Income
## 1.1572498 1.1572498 5.6841389
## Schooling
## 5.5479918
#Now, we group the variables with more than 5% missing values.
missing_percentage <- colMeans(is.na(expect))
variables_with_missing <- missing_percentage[missing_percentage > 0.05]
variables_with_missing
## Alcohol Hepatitis B Total expenditure GDP
## 0.06603131 0.18822328 0.07692308 0.15248468
## Population Resource Income Schooling
## 0.22191967 0.05684139 0.05547992
As we can see, the variables with more than 5% of missing values are Alcohol, Hepatitis B, Total expenditure, GDP, Population, Resource Income and Schooling.
Then, we see which of them are numeric.
is_numeric_vector <- sapply(variables_with_missing, is.numeric)
is_numeric_vector
## Alcohol Hepatitis B Total expenditure GDP
## TRUE TRUE TRUE TRUE
## Population Resource Income Schooling
## TRUE TRUE TRUE
All of the variables with missing values are numeric. Of the variables that have more than 5% missing values we will analyze whether they follow a random pattern. To do this, we’ll study homogeneity according to groups (NA and non-NA) with other variables. To assess whether the means of the variable in the NA group and the non-NA group are significantly different, we’ll perform a statistical test, the Student’s t-test is commonly used for this purpos and it’s the one we will use as our variables are continuous.
Firstly, we’ll define new variables that take the value of 0 if the corresponding data is not missing and the value of 1 if the corresponding data is missing.
Subsequently, a test of the equality of means is conducted between the groups of variable Measles (which has no missing values) corresponding to 0 and 1 of these new variables.
#Defining a function to create the new variables.
indicadoraNA<-function(data,na.rm=F){
data[!is.na(data)]<-0
data[is.na(data)]<-1
return(data)
}
#Preparing to perform the test, we use the previous function.
alcoholNA=indicadoraNA(expect$Alcohol)
hepBNA=indicadoraNA(expect$`Hepatitis B`)
totalexpNA=indicadoraNA(expect$`Total expenditure`)
gdpNA=indicadoraNA(expect$GDP)
populationNA=indicadoraNA(expect$Population)
incomeNA=indicadoraNA(expect$`Resource Income`)
schoolingNA=indicadoraNA(expect$Schooling)
#Student's t test.
t.test(expect$Measles~alcoholNA, var.equal=TRUE)
##
## Two Sample t-test
##
## data: expect$Measles by alcoholNA
## t = 1.2405, df = 2936, p-value = 0.2149
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -613.5377 2726.9543
## sample estimates:
## mean in group 0 mean in group 1
## 2489.368 1432.660
t.test(expect$Measles~hepBNA, var.equal=TRUE)
##
## Two Sample t-test
##
## data: expect$Measles by hepBNA
## t = -4.5361, df = 2936, p-value = 5.96e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -3504.629 -1389.222
## sample estimates:
## mean in group 0 mean in group 1
## 1959.024 4405.949
t.test(expect$Measles~totalexpNA, var.equal=TRUE)
##
## Two Sample t-test
##
## data: expect$Measles by totalexpNA
## t = 1.1044, df = 2936, p-value = 0.2695
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -679.885 2433.465
## sample estimates:
## mean in group 0 mean in group 1
## 2487.038 1610.248
t.test(expect$Measles~gdpNA, var.equal=TRUE)
##
## Two Sample t-test
##
## data: expect$Measles by gdpNA
## t = -0.70519, df = 2936, p-value = 0.4807
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1569.0568 738.9744
## sample estimates:
## mean in group 0 mean in group 1
## 2356.305 2771.346
t.test(expect$Measles~populationNA, var.equal=TRUE)
##
## Two Sample t-test
##
## data: expect$Measles by populationNA
## t = 1.1423, df = 2936, p-value = 0.2534
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -416.6876 1579.7638
## sample estimates:
## mean in group 0 mean in group 1
## 2548.647 1967.109
t.test(expect$Measles~incomeNA, var.equal=TRUE)
##
## Two Sample t-test
##
## data: expect$Measles by incomeNA
## t = -4.3771, df = 2936, p-value = 1.244e-05
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5773.207 -2201.069
## sample estimates:
## mean in group 0 mean in group 1
## 2192.958 6180.096
t.test(expect$Measles~schoolingNA, var.equal=TRUE)
##
## Two Sample t-test
##
## data: expect$Measles by schoolingNA
## t = -4.4964, df = 2936, p-value = 7.18e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5948.182 -2335.733
## sample estimates:
## mean in group 0 mean in group 1
## 2189.797 6331.755
Now we would examine the p-value in each case.
For the variables Alcohol, Total expenditure, GDP and Population p > 0.15, then there is no evidence to reject the null hypothesis of equal means. Therefore, the hypothesis of homogeneity is accepted, and it is concluded that the pattern is random. In the case of homogeneity the pattern is random and, in this case, it is chosen to replace the NA with the mean.
For the remaining variables that we are studying, which are Hepatitis B, Resource Income and Schooling, p < 0.15, so homogeneity cannot be assumed. In this case,this would have to be discussed with the researcher who poses the problem under analysis because they should neither be eliminated nor replaced, but since in this case it is not feasible, it is decided to act as in the case of a random pattern, so in this case, we replace them with the median.
In this case, since the variables are quantitative, missing values are replaced with the mean of the non-missing values.
# Define a function to replace NAs with the mean of each column. Creating the function that handles the missing values.
replace_na_with_mean <- function(column) {
mean_value <- mean(column, na.rm = TRUE)
column[is.na(column)] <- mean_value
return(column)
}
# Apply the replacement function to each column of 'expect'.
expect_imputed <- as.data.frame(lapply(expect, replace_na_with_mean))
Now we do Classical numeric descriptive analysis (measures of central tendency, dispersion, quantiles, symmetry, kurtosis, etc.). As we know the only qualitative variables are country and status, so will analyze those one separately later.
First, we’ll do the quantitative ones. The stat.desc
function from the pastecs package provides the most
important measures of position, dispersion, and shape:
#We care for the quantitative variables.
#We create a dataframe 'expect_num' where the NA have been substituted by the mean and the qualitative variables have been removed.
expect_num <- subset(expect_imputed, select = -c(Country, Status))
#Now we study the statistics with the function 'stat.desc'.
pastecs_stats <- stat.desc(expect_num)
pastecs_stats
As this function doesn’t provide us with the quantiles we’ll use the function summary:
summary(expect_num)
## Year Life.expectancy Adult.Mortality infant.deaths
## Min. :2000 Min. :36.30 Min. : 1.0 Min. : 0.0
## 1st Qu.:2004 1st Qu.:63.20 1st Qu.: 74.0 1st Qu.: 0.0
## Median :2008 Median :72.00 Median :144.0 Median : 3.0
## Mean :2008 Mean :69.22 Mean :164.8 Mean : 30.3
## 3rd Qu.:2012 3rd Qu.:75.60 3rd Qu.:227.0 3rd Qu.: 22.0
## Max. :2015 Max. :89.00 Max. :723.0 Max. :1800.0
## Alcohol percentage.expenditure Hepatitis.B Measles
## Min. : 0.010 Min. : 0.000 Min. : 1.00 Min. : 0.0
## 1st Qu.: 1.093 1st Qu.: 4.685 1st Qu.:80.94 1st Qu.: 0.0
## Median : 4.160 Median : 64.913 Median :87.00 Median : 17.0
## Mean : 4.603 Mean : 738.251 Mean :80.94 Mean : 2419.6
## 3rd Qu.: 7.390 3rd Qu.: 441.534 3rd Qu.:96.00 3rd Qu.: 360.2
## Max. :17.870 Max. :19479.912 Max. :99.00 Max. :212183.0
## BMI under.five.deaths Polio Total.expenditure
## Min. : 1.00 Min. : 0.00 Min. : 3.00 Min. : 0.370
## 1st Qu.:19.40 1st Qu.: 0.00 1st Qu.:78.00 1st Qu.: 4.370
## Median :43.00 Median : 4.00 Median :93.00 Median : 5.938
## Mean :38.32 Mean : 42.04 Mean :82.55 Mean : 5.938
## 3rd Qu.:56.10 3rd Qu.: 28.00 3rd Qu.:97.00 3rd Qu.: 7.330
## Max. :87.30 Max. :2500.00 Max. :99.00 Max. :17.600
## Diphtheria HIV.AIDS GDP Population
## Min. : 2.00 Min. : 0.100 Min. : 1.68 Min. :3.400e+01
## 1st Qu.:78.00 1st Qu.: 0.100 1st Qu.: 580.49 1st Qu.:4.189e+05
## Median :93.00 Median : 0.100 Median : 3116.56 Median :3.676e+06
## Mean :82.32 Mean : 1.742 Mean : 7483.16 Mean :1.275e+07
## 3rd Qu.:97.00 3rd Qu.: 0.800 3rd Qu.: 7483.16 3rd Qu.:1.275e+07
## Max. :99.00 Max. :50.600 Max. :119172.74 Max. :1.294e+09
## thinness..1.19.years thinness.5.9.years Resource.Income Schooling
## Min. : 0.10 Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.60 1st Qu.: 1.60 1st Qu.:0.5042 1st Qu.:10.30
## Median : 3.40 Median : 3.40 Median :0.6620 Median :12.10
## Mean : 4.84 Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.10 3rd Qu.: 7.20 3rd Qu.:0.7720 3rd Qu.:14.10
## Max. :27.70 Max. :28.60 Max. :0.9480 Max. :20.70
Now we care for the qualitative variables:
freq(expect_imputed$Country)
## Frequencies
## expect_imputed$Country
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ---------------------------------------------------------- ------ --------- -------------- --------- --------------
## Afghanistan 16 0.545 0.545 0.545 0.545
## Albania 16 0.545 1.089 0.545 1.089
## Algeria 16 0.545 1.634 0.545 1.634
## Angola 16 0.545 2.178 0.545 2.178
## Antigua and Barbuda 16 0.545 2.723 0.545 2.723
## Argentina 16 0.545 3.268 0.545 3.268
## Armenia 16 0.545 3.812 0.545 3.812
## Australia 16 0.545 4.357 0.545 4.357
## Austria 16 0.545 4.901 0.545 4.901
## Azerbaijan 16 0.545 5.446 0.545 5.446
## Bahamas 16 0.545 5.990 0.545 5.990
## Bahrain 16 0.545 6.535 0.545 6.535
## Bangladesh 16 0.545 7.080 0.545 7.080
## Barbados 16 0.545 7.624 0.545 7.624
## Belarus 16 0.545 8.169 0.545 8.169
## Belgium 16 0.545 8.713 0.545 8.713
## Belize 16 0.545 9.258 0.545 9.258
## Benin 16 0.545 9.803 0.545 9.803
## Bhutan 16 0.545 10.347 0.545 10.347
## Bolivia (Plurinational State of) 16 0.545 10.892 0.545 10.892
## Bosnia and Herzegovina 16 0.545 11.436 0.545 11.436
## Botswana 16 0.545 11.981 0.545 11.981
## Brazil 16 0.545 12.526 0.545 12.526
## Brunei Darussalam 16 0.545 13.070 0.545 13.070
## Bulgaria 16 0.545 13.615 0.545 13.615
## Burkina Faso 16 0.545 14.159 0.545 14.159
## Burundi 16 0.545 14.704 0.545 14.704
## Cabo Verde 16 0.545 15.248 0.545 15.248
## Cambodia 16 0.545 15.793 0.545 15.793
## Cameroon 16 0.545 16.338 0.545 16.338
## Canada 16 0.545 16.882 0.545 16.882
## Central African Republic 16 0.545 17.427 0.545 17.427
## Chad 16 0.545 17.971 0.545 17.971
## Chile 16 0.545 18.516 0.545 18.516
## China 16 0.545 19.061 0.545 19.061
## Colombia 16 0.545 19.605 0.545 19.605
## Comoros 16 0.545 20.150 0.545 20.150
## Congo 16 0.545 20.694 0.545 20.694
## Cook Islands 1 0.034 20.728 0.034 20.728
## Costa Rica 16 0.545 21.273 0.545 21.273
## Côte d'Ivoire 16 0.545 21.818 0.545 21.818
## Croatia 16 0.545 22.362 0.545 22.362
## Cuba 16 0.545 22.907 0.545 22.907
## Cyprus 16 0.545 23.451 0.545 23.451
## Czechia 16 0.545 23.996 0.545 23.996
## Democratic People's Republic of Korea 16 0.545 24.541 0.545 24.541
## Democratic Republic of the Congo 16 0.545 25.085 0.545 25.085
## Denmark 16 0.545 25.630 0.545 25.630
## Djibouti 16 0.545 26.174 0.545 26.174
## Dominica 1 0.034 26.208 0.034 26.208
## Dominican Republic 16 0.545 26.753 0.545 26.753
## Ecuador 16 0.545 27.297 0.545 27.297
## Egypt 16 0.545 27.842 0.545 27.842
## El Salvador 16 0.545 28.387 0.545 28.387
## Equatorial Guinea 16 0.545 28.931 0.545 28.931
## Eritrea 16 0.545 29.476 0.545 29.476
## Estonia 16 0.545 30.020 0.545 30.020
## Ethiopia 16 0.545 30.565 0.545 30.565
## Fiji 16 0.545 31.110 0.545 31.110
## Finland 16 0.545 31.654 0.545 31.654
## France 16 0.545 32.199 0.545 32.199
## Gabon 16 0.545 32.743 0.545 32.743
## Gambia 16 0.545 33.288 0.545 33.288
## Georgia 16 0.545 33.833 0.545 33.833
## Germany 16 0.545 34.377 0.545 34.377
## Ghana 16 0.545 34.922 0.545 34.922
## Greece 16 0.545 35.466 0.545 35.466
## Grenada 16 0.545 36.011 0.545 36.011
## Guatemala 16 0.545 36.555 0.545 36.555
## Guinea 16 0.545 37.100 0.545 37.100
## Guinea-Bissau 16 0.545 37.645 0.545 37.645
## Guyana 16 0.545 38.189 0.545 38.189
## Haiti 16 0.545 38.734 0.545 38.734
## Honduras 16 0.545 39.278 0.545 39.278
## Hungary 16 0.545 39.823 0.545 39.823
## Iceland 16 0.545 40.368 0.545 40.368
## India 16 0.545 40.912 0.545 40.912
## Indonesia 16 0.545 41.457 0.545 41.457
## Iran (Islamic Republic of) 16 0.545 42.001 0.545 42.001
## Iraq 16 0.545 42.546 0.545 42.546
## Ireland 16 0.545 43.091 0.545 43.091
## Israel 16 0.545 43.635 0.545 43.635
## Italy 16 0.545 44.180 0.545 44.180
## Jamaica 16 0.545 44.724 0.545 44.724
## Japan 16 0.545 45.269 0.545 45.269
## Jordan 16 0.545 45.813 0.545 45.813
## Kazakhstan 16 0.545 46.358 0.545 46.358
## Kenya 16 0.545 46.903 0.545 46.903
## Kiribati 16 0.545 47.447 0.545 47.447
## Kuwait 16 0.545 47.992 0.545 47.992
## Kyrgyzstan 16 0.545 48.536 0.545 48.536
## Lao People's Democratic Republic 16 0.545 49.081 0.545 49.081
## Latvia 16 0.545 49.626 0.545 49.626
## Lebanon 16 0.545 50.170 0.545 50.170
## Lesotho 16 0.545 50.715 0.545 50.715
## Liberia 16 0.545 51.259 0.545 51.259
## Libya 16 0.545 51.804 0.545 51.804
## Lithuania 16 0.545 52.349 0.545 52.349
## Luxembourg 16 0.545 52.893 0.545 52.893
## Madagascar 16 0.545 53.438 0.545 53.438
## Malawi 16 0.545 53.982 0.545 53.982
## Malaysia 16 0.545 54.527 0.545 54.527
## Maldives 16 0.545 55.071 0.545 55.071
## Mali 16 0.545 55.616 0.545 55.616
## Malta 16 0.545 56.161 0.545 56.161
## Marshall Islands 1 0.034 56.195 0.034 56.195
## Mauritania 16 0.545 56.739 0.545 56.739
## Mauritius 16 0.545 57.284 0.545 57.284
## Mexico 16 0.545 57.828 0.545 57.828
## Micronesia (Federated States of) 16 0.545 58.373 0.545 58.373
## Monaco 1 0.034 58.407 0.034 58.407
## Mongolia 16 0.545 58.952 0.545 58.952
## Montenegro 16 0.545 59.496 0.545 59.496
## Morocco 16 0.545 60.041 0.545 60.041
## Mozambique 16 0.545 60.585 0.545 60.585
## Myanmar 16 0.545 61.130 0.545 61.130
## Namibia 16 0.545 61.675 0.545 61.675
## Nauru 1 0.034 61.709 0.034 61.709
## Nepal 16 0.545 62.253 0.545 62.253
## Netherlands 16 0.545 62.798 0.545 62.798
## New Zealand 16 0.545 63.342 0.545 63.342
## Nicaragua 16 0.545 63.887 0.545 63.887
## Niger 16 0.545 64.432 0.545 64.432
## Nigeria 16 0.545 64.976 0.545 64.976
## Niue 1 0.034 65.010 0.034 65.010
## Norway 16 0.545 65.555 0.545 65.555
## Oman 16 0.545 66.099 0.545 66.099
## Pakistan 16 0.545 66.644 0.545 66.644
## Palau 1 0.034 66.678 0.034 66.678
## Panama 16 0.545 67.223 0.545 67.223
## Papua New Guinea 16 0.545 67.767 0.545 67.767
## Paraguay 16 0.545 68.312 0.545 68.312
## Peru 16 0.545 68.856 0.545 68.856
## Philippines 16 0.545 69.401 0.545 69.401
## Poland 16 0.545 69.946 0.545 69.946
## Portugal 16 0.545 70.490 0.545 70.490
## Qatar 16 0.545 71.035 0.545 71.035
## Republic of Korea 16 0.545 71.579 0.545 71.579
## Republic of Moldova 16 0.545 72.124 0.545 72.124
## Romania 16 0.545 72.668 0.545 72.668
## Russian Federation 16 0.545 73.213 0.545 73.213
## Rwanda 16 0.545 73.758 0.545 73.758
## Saint Kitts and Nevis 1 0.034 73.792 0.034 73.792
## Saint Lucia 16 0.545 74.336 0.545 74.336
## Saint Vincent and the Grenadines 16 0.545 74.881 0.545 74.881
## Samoa 16 0.545 75.425 0.545 75.425
## San Marino 1 0.034 75.459 0.034 75.459
## Sao Tome and Principe 16 0.545 76.004 0.545 76.004
## Saudi Arabia 16 0.545 76.549 0.545 76.549
## Senegal 16 0.545 77.093 0.545 77.093
## Serbia 16 0.545 77.638 0.545 77.638
## Seychelles 16 0.545 78.182 0.545 78.182
## Sierra Leone 16 0.545 78.727 0.545 78.727
## Singapore 16 0.545 79.272 0.545 79.272
## Slovakia 16 0.545 79.816 0.545 79.816
## Slovenia 16 0.545 80.361 0.545 80.361
## Solomon Islands 16 0.545 80.905 0.545 80.905
## Somalia 16 0.545 81.450 0.545 81.450
## South Africa 16 0.545 81.995 0.545 81.995
## South Sudan 16 0.545 82.539 0.545 82.539
## Spain 16 0.545 83.084 0.545 83.084
## Sri Lanka 16 0.545 83.628 0.545 83.628
## Sudan 16 0.545 84.173 0.545 84.173
## Suriname 16 0.545 84.717 0.545 84.717
## Swaziland 16 0.545 85.262 0.545 85.262
## Sweden 16 0.545 85.807 0.545 85.807
## Switzerland 16 0.545 86.351 0.545 86.351
## Syrian Arab Republic 16 0.545 86.896 0.545 86.896
## Tajikistan 16 0.545 87.440 0.545 87.440
## Thailand 16 0.545 87.985 0.545 87.985
## The former Yugoslav republic of Macedonia 16 0.545 88.530 0.545 88.530
## Timor-Leste 16 0.545 89.074 0.545 89.074
## Togo 16 0.545 89.619 0.545 89.619
## Tonga 16 0.545 90.163 0.545 90.163
## Trinidad and Tobago 16 0.545 90.708 0.545 90.708
## Tunisia 16 0.545 91.253 0.545 91.253
## Turkey 16 0.545 91.797 0.545 91.797
## Turkmenistan 16 0.545 92.342 0.545 92.342
## Tuvalu 1 0.034 92.376 0.034 92.376
## Uganda 16 0.545 92.920 0.545 92.920
## Ukraine 16 0.545 93.465 0.545 93.465
## United Arab Emirates 16 0.545 94.010 0.545 94.010
## United Kingdom of Great Britain and Northern Ireland 16 0.545 94.554 0.545 94.554
## United Republic of Tanzania 16 0.545 95.099 0.545 95.099
## United States of America 16 0.545 95.643 0.545 95.643
## Uruguay 16 0.545 96.188 0.545 96.188
## Uzbekistan 16 0.545 96.732 0.545 96.732
## Vanuatu 16 0.545 97.277 0.545 97.277
## Venezuela (Bolivarian Republic of) 16 0.545 97.822 0.545 97.822
## Viet Nam 16 0.545 98.366 0.545 98.366
## Yemen 16 0.545 98.911 0.545 98.911
## Zambia 16 0.545 99.455 0.545 99.455
## Zimbabwe 16 0.545 100.000 0.545 100.000
## <NA> 0 0.000 100.000
## Total 2938 100.000 100.000 100.000 100.000
freq(expect_imputed$Status)
## Frequencies
## expect_imputed$Status
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ---------------- ------ --------- -------------- --------- --------------
## Developed 512 17.43 17.43 17.43 17.43
## Developing 2426 82.57 100.00 82.57 100.00
## <NA> 0 0.00 100.00
## Total 2938 100.00 100.00 100.00 100.00
In the variable Country we can observe that a smaller sample has been taken for some countries, which are: Cook Islands, Dominica, Marshall Islands, Monaco, Nauru, Nive, Palau, Saint Kitts and Nevis, San Marino and Tuvalu. This is made as these countries have a smaller population.
To detect outliers, a preliminary exploratory analysis is required by constructing boxplots for all variables. The following visualization is not the best due to the difference between the scales.
boxplot(expect_num[-1],main="Outliers",
xlab="All explanatory variables",
ylab="Values",
col=c(1:11),
las = 2)
The following visualization is not affected by the difference between the scales as if the quantitative variables are standardized, the effect of scales differences is erased
sca<-scale(expect_num[-1])
boxplot(sca,main="Outliers",
xlab="All explanatory variables",
ylab="Values",
col=c(1:11))
As we can see there are many outliers, so we could not eliminate them. We’ll have to substitute them for the mean.
Since all variables are quantitative, the decision has been made to replace them with the mean. To achieve this, the “outlier” function has been developed to perform the detection and manipulation of outliers. It is worth noting that the decision to replace outliers with the mean would necessitate a prior analysis of the cause of these outlier values.
#Defining the outlier function.
outlier<-function(data,na.rm=T){
H<-1.5*IQR(data, na.rm=T)
data[data<quantile(data,0.25, na.rm = T)-H]<-NA
data[data>quantile(data,0.75, na.rm = T)+H]<-NA
data[is.na(data)]<-mean(data, na.rm = T)
H<-1.5*IQR(data)
if (TRUE %in% (data<quantile(data,0.25,na.rm = T)-H) |
TRUE %in% (data>quantile(data,0.75,na.rm = T)+H))
outlier(data)
else
return(data)
}
# Application of the outlier function to substitute the outliers for the mean.
data<-as.data.frame(apply(expect_num,2,outlier))
# Lastly, a comparison between the original data and the corrected data with the respective boxplots.
par(mfrow=c(1,2))
# Boxplot of the original data.
boxplot(expect_num,main="Original data",
xlab="Explanatory variables",
ylab="Values",
col=c(1:ncol(expect_num)))
# Boxplot of the corrected data.
boxplot(data,main="Data without outliers",
xlab="Explanatory variables",
ylab="Values",
col=c(1:ncol(data)))
As we can see, the data is not scaled, so now we’ll see the boxplots with scaled data.
#We apply the function to substitute the outliers.
data_norm<-as.data.frame(apply(sca,2,outlier))
#Lastly, a comparison between the original data and the corrected data with the respective boxplots.
par(mfrow=c(1,2))
# Boxplot of the original data.
boxplot(sca,main="Original data",
xlab="Explanatory variables",
ylab="Values",
col=c(1:ncol(expect_num)))
# Boxplot of the corrected data.
boxplot(data_norm,main="Data without outliers",
xlab="Explanatory variables",
ylab="Values",
col=c(1:ncol(expect_num)))
First, it is necessary to verify that the variables are not independent:
-At the level of the sample collected in the database, this can be done by calculating and observing the correlation matrix.
-At a population level, it will be justified that there is a correlation using ‘Bartlett’s Test of sphericity’. This test checks if the correlations are significantly different from 0.The null hypothesis is H_0: det(R) = 1; it means that the variables are not correlated, R denotes the correlation matrix. This function works with standardized data.
#Correlation matrix.
correlation_matrix<- cor(data)
correlation_matrix
## Year Life.expectancy Adult.Mortality
## Year 1.00000000 0.15250262 -0.01657843
## Life.expectancy 0.15250262 1.00000000 -0.57411530
## Adult.Mortality -0.01657843 -0.57411530 1.00000000
## infant.deaths -0.03193416 -0.50566581 0.33970610
## Alcohol -0.04925409 0.39542286 -0.21055256
## percentage.expenditure -0.04302649 0.30478901 -0.20762533
## Hepatitis.B 0.18727322 0.32680387 -0.20138411
## Measles -0.05629980 -0.24356404 0.13748341
## BMI 0.10832670 0.55644633 -0.36084781
## under.five.deaths -0.02627466 -0.44562526 0.31075365
## Polio 0.08075056 0.40103560 -0.25089144
## Total.expenditure 0.04554965 0.22408065 -0.16141448
## Diphtheria 0.09997122 0.40414736 -0.26518865
## HIV.AIDS -0.04205825 -0.66178696 0.40502939
## GDP 0.12427861 0.29125683 -0.12147221
## Population 0.01953851 -0.01060332 0.03390156
## thinness..1.19.years -0.04775821 -0.60932003 0.35890225
## thinness.5.9.years -0.03322319 -0.60668780 0.36984153
## Resource.Income 0.12855710 0.81448108 -0.49792606
## Schooling 0.13954171 0.67848169 -0.40594116
## infant.deaths Alcohol percentage.expenditure
## Year -0.03193416 -0.04925409 -0.043026485
## Life.expectancy -0.50566581 0.39542286 0.304789009
## Adult.Mortality 0.33970610 -0.21055256 -0.207625335
## infant.deaths 1.00000000 -0.33964546 -0.269115025
## Alcohol -0.33964546 1.00000000 0.189363467
## percentage.expenditure -0.26911502 0.18936347 1.000000000
## Hepatitis.B -0.28981084 0.11861959 0.129193575
## Measles 0.38129436 -0.13953262 -0.148422850
## BMI -0.40463100 0.31656838 0.222071629
## under.five.deaths 0.69124728 -0.31255232 -0.212785308
## Polio -0.28412162 0.22006671 0.149810964
## Total.expenditure -0.22908354 0.29892302 0.107914633
## Diphtheria -0.29540580 0.22986006 0.154403667
## HIV.AIDS 0.38581024 -0.18461498 -0.184295246
## GDP -0.15742062 0.24849679 -0.001114271
## Population 0.19347234 -0.06955884 -0.184717447
## thinness..1.19.years 0.35012332 -0.41234498 -0.224947396
## thinness.5.9.years 0.36226862 -0.40712881 -0.226128862
## Resource.Income -0.50987566 0.49854360 0.360029134
## Schooling -0.46652459 0.50180812 0.304087449
## Hepatitis.B Measles BMI under.five.deaths
## Year 0.18727322 -0.05629980 0.10832670 -0.02627466
## Life.expectancy 0.32680387 -0.24356404 0.55644633 -0.44562526
## Adult.Mortality -0.20138411 0.13748341 -0.36084781 0.31075365
## infant.deaths -0.28981084 0.38129436 -0.40463100 0.69124728
## Alcohol 0.11861959 -0.13953262 0.31656838 -0.31255232
## percentage.expenditure 0.12919358 -0.14842285 0.22207163 -0.21278531
## Hepatitis.B 1.00000000 -0.15603801 0.21003355 -0.21589883
## Measles -0.15603801 1.00000000 -0.23745465 0.31577327
## BMI 0.21003355 -0.23745465 1.00000000 -0.32887835
## under.five.deaths -0.21589883 0.31577327 -0.32887835 1.00000000
## Polio 0.49788803 -0.12400568 0.21561855 -0.24101112
## Total.expenditure 0.02487489 -0.14539433 0.20785582 -0.16825635
## Diphtheria 0.50428189 -0.12729666 0.21950700 -0.24348076
## HIV.AIDS -0.22736643 0.17223318 -0.43900559 0.35721178
## GDP 0.18512152 -0.08828064 0.25075566 -0.17347086
## Population 0.03654473 0.12717832 -0.01934333 0.13498898
## thinness..1.19.years -0.12252064 0.22938223 -0.51910743 0.32586046
## thinness.5.9.years -0.13242165 0.24123086 -0.52830322 0.32985369
## Resource.Income 0.32968635 -0.23665929 0.58175142 -0.46813662
## Schooling 0.28325557 -0.21253215 0.51536836 -0.41267225
## Polio Total.expenditure Diphtheria HIV.AIDS
## Year 0.08075056 0.04554965 0.09997122 -0.04205825
## Life.expectancy 0.40103560 0.22408065 0.40414736 -0.66178696
## Adult.Mortality -0.25089144 -0.16141448 -0.26518865 0.40502939
## infant.deaths -0.28412162 -0.22908354 -0.29540580 0.38581024
## Alcohol 0.22006671 0.29892302 0.22986006 -0.18461498
## percentage.expenditure 0.14981096 0.10791463 0.15440367 -0.18429525
## Hepatitis.B 0.49788803 0.02487489 0.50428189 -0.22736643
## Measles -0.12400568 -0.14539433 -0.12729666 0.17223318
## BMI 0.21561855 0.20785582 0.21950700 -0.43900559
## under.five.deaths -0.24101112 -0.16825635 -0.24348076 0.35721178
## Polio 1.00000000 0.08557358 0.83402939 -0.34260272
## Total.expenditure 0.08557358 1.00000000 0.10029069 -0.11642070
## Diphtheria 0.83402939 0.10029069 1.00000000 -0.34559476
## HIV.AIDS -0.34260272 -0.11642070 -0.34559476 1.00000000
## GDP 0.18933416 0.08210680 0.18731256 -0.25862973
## Population 0.01198798 -0.14634868 0.03148769 -0.03028617
## thinness..1.19.years -0.17506796 -0.31176037 -0.19632835 0.44316357
## thinness.5.9.years -0.19532547 -0.32077192 -0.21411901 0.42309886
## Resource.Income 0.39473106 0.18272611 0.39137309 -0.56461340
## Schooling 0.37405864 0.25232368 0.36831656 -0.46831628
## GDP Population thinness..1.19.years
## Year 0.124278608 0.01953851 -0.04775821
## Life.expectancy 0.291256832 -0.01060332 -0.60932003
## Adult.Mortality -0.121472214 0.03390156 0.35890225
## infant.deaths -0.157420621 0.19347234 0.35012332
## Alcohol 0.248496790 -0.06955884 -0.41234498
## percentage.expenditure -0.001114271 -0.18471745 -0.22494740
## Hepatitis.B 0.185121522 0.03654473 -0.12252064
## Measles -0.088280636 0.12717832 0.22938223
## BMI 0.250755657 -0.01934333 -0.51910743
## under.five.deaths -0.173470856 0.13498898 0.32586046
## Polio 0.189334157 0.01198798 -0.17506796
## Total.expenditure 0.082106795 -0.14634868 -0.31176037
## Diphtheria 0.187312558 0.03148769 -0.19632835
## HIV.AIDS -0.258629727 -0.03028617 0.44316357
## GDP 1.000000000 0.28904708 -0.19123381
## Population 0.289047078 1.00000000 0.05586769
## thinness..1.19.years -0.191233814 0.05586769 1.00000000
## thinness.5.9.years -0.192708380 0.06125504 0.92154176
## Resource.Income 0.362835252 -0.03309412 -0.56734810
## Schooling 0.314214831 -0.05425340 -0.50681145
## thinness.5.9.years Resource.Income Schooling
## Year -0.03322319 0.12855710 0.1395417
## Life.expectancy -0.60668780 0.81448108 0.6784817
## Adult.Mortality 0.36984153 -0.49792606 -0.4059412
## infant.deaths 0.36226862 -0.50987566 -0.4665246
## Alcohol -0.40712881 0.49854360 0.5018081
## percentage.expenditure -0.22612886 0.36002913 0.3040874
## Hepatitis.B -0.13242165 0.32968635 0.2832556
## Measles 0.24123086 -0.23665929 -0.2125321
## BMI -0.52830322 0.58175142 0.5153684
## under.five.deaths 0.32985369 -0.46813662 -0.4126722
## Polio -0.19532547 0.39473106 0.3740586
## Total.expenditure -0.32077192 0.18272611 0.2523237
## Diphtheria -0.21411901 0.39137309 0.3683166
## HIV.AIDS 0.42309886 -0.56461340 -0.4683163
## GDP -0.19270838 0.36283525 0.3142148
## Population 0.06125504 -0.03309412 -0.0542534
## thinness..1.19.years 0.92154176 -0.56734810 -0.5068114
## thinness.5.9.years 1.00000000 -0.56094672 -0.5065113
## Resource.Income -0.56094672 1.00000000 0.8229748
## Schooling -0.50651129 0.82297477 1.0000000
#Standardized data.
data_scaled<- scale(data)
#Bartlett's Test of sphericity.
cortest.bartlett(cor(data_scaled))
## $chisq
## [1] 961.4059
##
## $p.value
## [1] 2.441366e-103
##
## $df
## [1] 190
$chisq: This value represents the chi-square statistic obtained from ‘Bartlett’s test of sphericity’. It measures the discrepancy between the observed correlation matrix and an identity matrix, i.e., a matrix in which all variables are uncorrelated. It is a measure of how much the observed correlations deviate from the expected correlations if the variables were uncorrelated; thus, a large value indicates that the variables are correlated.
$p.value: This is the p-value associated with the chi-square statistic. It indicates the probability of obtaining a chi-square statistic as extreme as the calculated one (or more extreme) if the variables were truly uncorrelated. Since the p-value is extremely small, it provides strong evidence against the null hypothesis that the variables are not correlated. The null hypothesis is rejected, meaning that det(R) = 1; hence det(R) != 1.
$df: These are the degrees of freedom associated with the chi-square statistic, in this case, they are 190.
After this, it is requested to carry out a study of the possibility of reducing the dimension through observable variables. It is convenient to choose the optimal number of principal components using different graphical techniques.
Firstly, we’ll study the polychoric correlation matrix.
poly_cor<- hetcor(expect_num)$correlations
ggcorrplot(poly_cor, type="lower",hc.order=T, tl.cex = 10)
A polychoric correlation matrix evaluates the relationships between two or more variables. This matrix quantifies the relationship between pairs of variables and provides a measure of how these variables tend to move together or in opposite directions. Polychoric correlation values range from -1 to 1.
A positive value indicates a positive relationship, meaning that as the values of one variable increase, the values of the other tend to increase. Conversely, a negative value suggests a negative relationship, where an increase in the values of one variable is associated with a decrease in the values of the other. When the value approaches 0, it indicates a weak relationship between the variables. The magnitude of the value reflects the strength of the observed association. In this case, as the relationship takes on a redder hue, it indicates a more direct correlation between the variables. On the other hand, as it becomes bluer, it suggests a stronger negative correlation. When the color approaches white, it indicates that the variables have a weaker relationship.
#Another interesting representation would be the following one.
corrplot(correlation_matrix, order = "hclust", tl.col='black', tl.cex=0.7)
This is a graphical representation of the correlation matrix. The Pearson correlation matrix is computed for the data in our dataset, meaning pairwise correlations are calculated between all variables in the dataset, yielding a square correlation matrix. Positive correlations are depicted by an intense blue color, which is why the diagonal of the matrix is blue since all variables are entirely correlated with themselves. Negative correlations are indicated by an intense red color.
Observing the graphical representations, it appears that there are between 4 and 5 groups of latent variables highly correlated with one group of observable variables and weakly correlated with the rest. This provides an intuitive sense of the optimal number of factors for our Factor Analysis.
Based on the results obtained in the previous sections, we are ready to perform dimensionality reduction using Principal Component Analysis (PCA) or Factor Analysis, as the data is correlated, contains no missing values, and the presence of outliers has been avoided.
The following code performs PCA, obtaining the eigenvectors that generate each component, as well as their eigenvalues, which correspond to the variance of each component.
#Using prcomp to compute the principal components (eigenvalues and eigenvectors).
#With scale=TRUE, variable means are set to zero, and variances set to one.
expect_pca <- prcomp(data,scale=TRUE)
#The "rotation" field of the "expect_pca" object is a matrix whose columns
#are the coefficients of the principal components, i.e., the
#weight of each variable in the corresponding principal component.
expect_pca$rotation
## PC1 PC2 PC3 PC4
## Year -0.05039560 0.1635392780 -0.11615031 -0.43419326
## Life.expectancy -0.33036601 -0.0007331087 -0.10183706 0.03802842
## Adult.Mortality 0.22223435 0.0169965057 0.01445102 -0.14470781
## infant.deaths 0.25549725 0.0547922822 -0.32417394 0.30325989
## Alcohol -0.21116083 -0.1321737482 -0.03421661 0.11471888
## percentage.expenditure -0.14737094 -0.0988024316 0.26515739 0.23292234
## Hepatitis.B -0.16377588 0.4225957589 0.13271579 -0.06245054
## Measles 0.14037039 0.0886268312 -0.27396652 0.45518813
## BMI -0.25820730 -0.1146238362 -0.13527631 -0.07385094
## under.five.deaths 0.23095279 0.0602532507 -0.27424724 0.34225124
## Polio -0.20190862 0.4786971839 0.14835802 0.24737499
## Total.expenditure -0.12954884 -0.2216846278 0.06196571 0.06779824
## Diphtheria -0.20545590 0.4743879267 0.13889388 0.24203717
## HIV.AIDS 0.25390036 -0.0545466009 0.11835201 -0.02977423
## GDP -0.14225915 0.1572181191 -0.40731197 -0.32399638
## Population 0.02898727 0.2360778373 -0.55146002 -0.12839398
## thinness..1.19.years 0.27376956 0.2798402880 0.19574895 -0.15381806
## thinness.5.9.years 0.27546647 0.2725863602 0.18013825 -0.15884438
## Resource.Income -0.33351334 -0.0013924645 -0.09152134 0.01037712
## Schooling -0.30567926 -0.0118349625 -0.06982695 0.01750321
## PC5 PC6 PC7 PC8
## Year 0.29718445 0.661637939 -0.28330347 -0.10291026
## Life.expectancy -0.15950003 0.132478949 -0.02364510 -0.09667275
## Adult.Mortality 0.31446611 -0.137454065 -0.07377528 0.38922069
## infant.deaths 0.09678651 0.181791990 -0.04174313 0.16406518
## Alcohol 0.34299350 -0.412901663 -0.28134847 -0.02628749
## percentage.expenditure -0.19055406 0.049064971 -0.51180870 0.49353017
## Hepatitis.B 0.09188185 0.111336717 0.02851010 0.15639876
## Measles -0.04983325 0.024776285 -0.34206666 -0.48582766
## BMI -0.06204044 0.126781094 0.01567170 0.10354635
## under.five.deaths 0.14827527 0.246299336 -0.04586746 0.24357189
## Polio 0.14983740 -0.048873652 0.14190817 0.02652263
## Total.expenditure 0.60195405 -0.040231404 0.09841795 -0.30030032
## Diphtheria 0.16361939 -0.036229926 0.15681018 0.02834705
## HIV.AIDS 0.32842048 -0.090106059 -0.18405402 0.14697691
## GDP 0.09800907 -0.339765315 -0.18328085 0.06966947
## Population -0.12718281 -0.250535195 0.13229450 0.15466548
## thinness..1.19.years -0.11563399 -0.142980560 -0.26641079 -0.20017103
## thinness.5.9.years -0.12501773 -0.131968640 -0.28418414 -0.20589095
## Resource.Income -0.10908327 0.002337521 -0.25590327 -0.01027066
## Schooling 0.03401024 -0.024270068 -0.30677835 -0.04714555
## PC9 PC10 PC11 PC12
## Year 0.05936855 -0.079401382 0.023821523 0.298831272
## Life.expectancy -0.03705585 0.076479408 -0.020004876 0.077097634
## Adult.Mortality 0.37616897 -0.296412110 -0.476945044 -0.025575316
## infant.deaths -0.06406277 0.264736129 -0.036675577 0.015382992
## Alcohol 0.26298836 0.221702937 0.256656505 0.135673548
## percentage.expenditure -0.39752397 -0.299278076 -0.061495628 0.121803736
## Hepatitis.B 0.01196103 -0.196458854 0.397705360 -0.564135424
## Measles 0.21863596 -0.493725129 -0.035019466 -0.165005312
## BMI 0.03448427 0.032832434 -0.039788432 -0.500995452
## under.five.deaths -0.15219049 0.386971809 -0.038094806 -0.171131599
## Polio 0.04559221 0.051736363 -0.149454886 0.156649052
## Total.expenditure -0.59883409 -0.150285934 -0.121504583 -0.091403888
## Diphtheria 0.02638437 0.014306985 -0.108688002 0.184280295
## HIV.AIDS 0.07103142 0.006480065 0.538396775 -0.008838757
## GDP -0.15123464 0.067912604 -0.300102739 -0.247618245
## Population -0.27691658 -0.290513699 0.317218284 0.325544622
## thinness..1.19.years -0.19040089 0.195483658 -0.054497180 -0.070575560
## thinness.5.9.years -0.17581079 0.190400672 -0.074237654 -0.051466917
## Resource.Income 0.08183350 0.161231231 0.004636026 0.016336700
## Schooling 0.10087218 0.207399033 -0.012251106 0.052363638
## PC13 PC14 PC15 PC16
## Year 0.01225355 0.09607309 0.188071502 0.01947916
## Life.expectancy 0.01406141 -0.19354098 0.003276960 0.08889360
## Adult.Mortality -0.26254004 -0.27504463 -0.094641078 0.01265549
## infant.deaths 0.06826052 -0.04982975 0.007183562 -0.75572983
## Alcohol -0.07240371 -0.09273099 0.579762201 -0.01734889
## percentage.expenditure 0.06737358 0.09280136 0.125840916 -0.02242875
## Hepatitis.B 0.14526661 -0.41574455 0.073587810 -0.09980227
## Measles 0.02898145 0.08914250 -0.012434305 0.07091039
## BMI -0.59019078 0.48155436 0.131824238 -0.10234442
## under.five.deaths -0.05238575 -0.12418083 0.059860753 0.61419849
## Polio -0.04526815 0.21966450 -0.064415795 -0.01890306
## Total.expenditure -0.13147338 -0.14267944 -0.087935911 -0.06140062
## Diphtheria -0.04112841 0.22689628 -0.004193625 0.03762600
## HIV.AIDS 0.04982527 0.39639881 -0.433987090 0.06138024
## GDP 0.51860951 0.25930036 -0.023012315 0.04320723
## Population -0.34796672 -0.09606712 -0.013910461 0.04341353
## thinness..1.19.years -0.21692368 -0.02034840 0.037180148 -0.01845632
## thinness.5.9.years -0.21564206 -0.06987844 0.074687436 -0.03907094
## Resource.Income -0.06191748 -0.15281685 -0.267838242 -0.03080529
## Schooling -0.17982142 -0.21799842 -0.545392981 -0.03233309
## PC17 PC18 PC19 PC20
## Year -0.070556053 -0.0292078317 0.028457414 -1.724463e-02
## Life.expectancy 0.738177883 0.0356133583 -0.469071545 2.840843e-02
## Adult.Mortality 0.204128461 0.0284589990 -0.025531212 -1.173581e-02
## infant.deaths 0.079047447 0.0386073739 -0.040013419 -1.245200e-02
## Alcohol -0.029213798 -0.0242018304 -0.033141027 -1.518932e-02
## percentage.expenditure -0.046416878 -0.0148282696 -0.040558753 3.722456e-05
## Hepatitis.B -0.047531764 -0.0071804854 -0.013217699 -7.779322e-04
## Measles 0.002696099 -0.0036986803 -0.003633262 -5.436664e-03
## BMI 0.017229725 0.0074788883 -0.043872587 1.872838e-02
## under.five.deaths -0.043080910 -0.0189290410 0.041804613 2.524983e-03
## Polio 0.054642421 -0.7004222293 0.026961238 2.127814e-02
## Total.expenditure 0.057638804 -0.0007739644 0.070895675 1.560641e-03
## Diphtheria -0.036664271 0.7062234145 -0.002644184 3.299977e-03
## HIV.AIDS 0.305711419 0.0192128199 -0.062335140 4.159963e-02
## GDP 0.002753927 0.0089570525 -0.054900541 1.117356e-02
## Population -0.032138654 -0.0299204213 0.021237575 -3.515843e-03
## thinness..1.19.years 0.083182767 0.0081735814 -0.069454869 -7.046161e-01
## thinness.5.9.years 0.072003239 0.0272959352 0.028708265 7.038266e-01
## Resource.Income 0.248607250 0.0527229940 0.775987881 -5.648313e-02
## Schooling -0.467663707 -0.0036697603 -0.386509071 2.459497e-02
#In the "sdev" field of the "expect_pca" object, along with the summary function applied
#to the object, relevant information is obtained: standard deviations of
#each principal component, proportion of explained and cumulative variance.
expect_pca$sdev
## [1] 2.6232180 1.3778737 1.2203912 1.0552838 1.0334747 1.0127769 0.9493830
## [8] 0.8955225 0.8483023 0.7908919 0.7866141 0.7538461 0.7298716 0.6969106
## [15] 0.6265453 0.5387641 0.4753598 0.4047081 0.3468292 0.2769408
summary(expect_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6232 1.37787 1.22039 1.05528 1.0335 1.01278 0.94938
## Proportion of Variance 0.3441 0.09493 0.07447 0.05568 0.0534 0.05129 0.04507
## Cumulative Proportion 0.3441 0.43899 0.51346 0.56914 0.6225 0.67383 0.71890
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.8955 0.84830 0.79089 0.78661 0.75385 0.72987 0.69691
## Proportion of Variance 0.0401 0.03598 0.03128 0.03094 0.02841 0.02664 0.02428
## Cumulative Proportion 0.7590 0.79497 0.82625 0.85719 0.88560 0.91224 0.93652
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 0.62655 0.53876 0.4754 0.40471 0.34683 0.27694
## Proportion of Variance 0.01963 0.01451 0.0113 0.00819 0.00601 0.00383
## Cumulative Proportion 0.95615 0.97066 0.9820 0.99015 0.99617 1.00000
The following code explains the behaviour of the explained variance for each principal component and the behaviour of the cumulative explained variance.
eigen_expect <- expect_pca$sdev^2
names(eigen_expect) <- paste("PC",1:20,sep="")
eigen_expect
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## 6.88127260 1.89853604 1.48935464 1.11362380 1.06806999 1.02571703 0.90132806
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## 0.80196053 0.71961676 0.62551002 0.61876173 0.56828399 0.53271250 0.48568438
## PC15 PC16 PC17 PC18 PC19 PC20
## 0.39255898 0.29026670 0.22596694 0.16378861 0.12029050 0.07669621
sumlambdas <- sum(eigen_expect)
sumlambdas
## [1] 20
#shows the percent variance each principal component holds
propvar <- eigen_expect/sumlambdas
propvar
## PC1 PC2 PC3 PC4 PC5 PC6
## 0.344063630 0.094926802 0.074467732 0.055681190 0.053403500 0.051285851
## PC7 PC8 PC9 PC10 PC11 PC12
## 0.045066403 0.040098026 0.035980838 0.031275501 0.030938087 0.028414199
## PC13 PC14 PC15 PC16 PC17 PC18
## 0.026635625 0.024284219 0.019627949 0.014513335 0.011298347 0.008189431
## PC19 PC20
## 0.006014525 0.003834811
#shows the cumulative percent variance
percent <- cumsum(propvar)
percent
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.3440636 0.4389904 0.5134582 0.5691394 0.6225429 0.6738287 0.7188951 0.7589931
## PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
## 0.7949740 0.8262495 0.8571876 0.8856018 0.9122374 0.9365216 0.9561496 0.9706629
## PC17 PC18 PC19 PC20
## 0.9819612 0.9901507 0.9961652 1.0000000
The following graphs explain the behaviour of the explained variance for each principal component and the behaviour of the cumulative explained variance.
#Proportion of the explained variance.
expl_var <- expect_pca$sdev^2 / sum(expect_pca$sdev^2)
ggplot(data = data.frame(expl_var, pc = 1:ncol(data)),
aes(x = pc, y = expl_var, fill=expl_var )) +
geom_col(width = 0.3) +
scale_y_continuous(limits = c(0,0.6)) + theme_bw() +
labs(x = "Principal component", y = "Proportion of explained variance")
#Proportion of the cumulative explained variance.
var_cum<-cumsum(expl_var)
ggplot( data = data.frame(var_cum, pc = 1:ncol(data)),
aes(x = pc, y = var_cum ,fill=var_cum )) +
geom_col(width = 0.5) +
scale_y_continuous(limits = c(0,1)) +
theme_bw() +
labs(x = "Principal component", y = "Proportion of cumulative explained variance")
To choose the appropriate number of principal components, the following method is used: The variances explained by the principal components are averaged, and those whose proportion of explained variance exceeds the mean are selected.
expect_pca$sdev^2
## [1] 6.88127260 1.89853604 1.48935464 1.11362380 1.06806999 1.02571703
## [7] 0.90132806 0.80196053 0.71961676 0.62551002 0.61876173 0.56828399
## [13] 0.53271250 0.48568438 0.39255898 0.29026670 0.22596694 0.16378861
## [19] 0.12029050 0.07669621
mean(expect_pca$sdev^2)
## [1] 1
This analysis recommends taking 6 as the optimal number of principal components. It is necessary that we only retain enough components to explain a specified significantly large percentage of the total variance in the original variables.Typically, values between 70% and 90% are suggested, although lower values may be appropriate as the sample size increases. As this dataset has almost 3000 observations, it is safe to choose 6 principal components because it represents nearly 70% (67.3828705%) of the total variance. If a higher percentage is desired, one could opt for 7 variables -as PC7’s proportion of explained variance is really close to the mean- resulting in a percentage of 71.8895108%.”
Now having chosen the optimal number of principal components, we’ll perform a reduction of the dimension via factor analysis.
A method for extracting factors should be chosen, such as the principal factor, maximum likelihood, etc. The fa() function implements up to 6 different methods. This example compares the results of two methods.
First, two models with three factors each are created. The factor matrix for three latent factors in the two models is displayed.
#Two models with three factors.
#Model 1: factor analysis is performed in a polychoric correlation matrix.
modelo1<-fa(poly_cor,
nfactors = 3,
rotate = "none",
fm="mle") #maximum-likelihood method
#It is specified that three factors are extracted, and no rotation is applied to them. The maximum likelihood method is used for the estimation of these factors.
#Model 2.
modelo2<-fa(poly_cor,
nfactors = 3,
rotate = "none",
fm="minres") #minimal residual model
#The second model is analogous to the previous one, except for the last command, where the estimation of the factors is done using the minimum residuals model.
#Outputs of these models: factor weight matrices, etc. Blank entries represent null loadings.
print("Modelo 1: mle")
## [1] "Modelo 1: mle"
modelo1$loadings
##
## Loadings:
## ML2 ML1 ML3
## Year 0.197
## Life.expectancy 0.616 -0.232 0.645
## Adult.Mortality -0.445 0.102 -0.487
## infant.deaths 0.997
## Alcohol 0.477 -0.131 0.150
## percentage.expenditure 0.337 0.287
## Hepatitis.B -0.185 0.225
## Measles 0.505
## BMI 0.570 -0.254 0.220
## under.five.deaths 0.997
## Polio 0.263 -0.189 0.410
## Total.expenditure 0.265 -0.138
## Diphtheria 0.268 -0.195 0.424
## HIV.AIDS -0.322 -0.357
## GDP 0.358 -0.122 0.335
## Population 0.541
## thinness..1.19.years -0.799 0.500 0.212
## thinness.5.9.years -0.800 0.505 0.222
## Resource.Income 0.543 -0.172 0.541
## Schooling 0.562 -0.220 0.532
##
## ML2 ML1 ML3
## SS loadings 3.597 3.420 2.170
## Proportion Var 0.180 0.171 0.109
## Cumulative Var 0.180 0.351 0.459
print("Modelo 2: minres")
## [1] "Modelo 2: minres"
modelo2$loadings
##
## Loadings:
## MR1 MR2 MR3
## Year 0.169 0.101
## Life.expectancy 0.833 0.323
## Adult.Mortality -0.537 -0.276
## infant.deaths -0.519 0.819
## Alcohol 0.494 0.134 -0.199
## percentage.expenditure 0.466 0.234 -0.348
## Hepatitis.B 0.288 0.465
## Measles -0.298 0.382
## BMI 0.632
## under.five.deaths -0.533 0.800
## Polio 0.509 0.121 0.497
## Total.expenditure 0.297
## Diphtheria 0.531 0.126 0.579
## HIV.AIDS -0.346 -0.204
## GDP 0.511 0.246 -0.305
## Population -0.213 0.503
## thinness..1.19.years -0.717 0.216 0.262
## thinness.5.9.years -0.715 0.220 0.263
## Resource.Income 0.715 0.301
## Schooling 0.758 0.270
##
## MR1 MR2 MR3
## SS loadings 5.763 2.365 1.245
## Proportion Var 0.288 0.118 0.062
## Cumulative Var 0.288 0.406 0.469
Finally, a comparison of the communalities of these two methods is illustrated. It appears that communalities of the likelihood model (first column) are greater than those of the minimum residual model (second column).
#Comparison of the communalities.
sort(modelo1$communality,decreasing = T)->c1
sort(modelo2$communality,decreasing = T)->c2
head(cbind(c1,c2))
## c1 c2
## infant.deaths 0.9959985 0.9405253
## under.five.deaths 0.9959976 0.9271596
## thinness.5.9.years 0.9437952 0.8060443
## thinness..1.19.years 0.9338352 0.6494882
## Life.expectancy 0.8494845 0.6326199
## Schooling 0.6472546 0.6292941
#Comparison of uniqueness, that is, the proportion of variance that has not been explained by the factor (1-communality).
sort(modelo1$uniquenesses,decreasing = T)->u1
sort(modelo2$uniquenesses,decreasing = T)->u2
head(cbind(u1,u2),n=10)
## u1 u2
## Year 0.9517615 0.9554806
## Hepatitis.B 0.9099781 0.9035363
## Total.expenditure 0.9096144 0.8311416
## percentage.expenditure 0.7943096 0.7606215
## HIV.AIDS 0.7669905 0.7014155
## GDP 0.7452375 0.6996738
## Measles 0.7394481 0.6986724
## Alcohol 0.7329746 0.6322187
## Polio 0.7265124 0.6075658
## Diphtheria 0.7107326 0.5896661
There are different criteria, among which the Scree plot (Cattel 1966) and parallel analysis (Horn 1965) stand out. According to the following graphical outputs, 3 is considered to be the optimal number of factors (parallel analysis), despite 6 are the appropriate according to the first graphical scree plot.
#Scree plot.
scree(poly_cor)
It is observed that there is y=1 in the plot as a reference line for random variance. Eigenvalues of the correlation matrix that are above it explain more variance than expected by chance. Therefore, according to this Scree plot, 6 would be taken as the optimal number of factors.
#Parallel analysis.
fa.parallel(poly_cor,n.obs=100,fa="fa",fm="ml")
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
In this plot, the real data and simulated data with the model are represented by blue and red lines, respectively. It is observed that there are clearly 3 eigenvalues of the principal factors above the plot of simulated data. This indicates that 3 would be considered as the optimal number of factors.
The factorial model with 5 factors implementing a varimax rotation to seek a simpler interpretation is performed.
modelo_varimax<-fa(poly_cor,nfactors = 5,rotate = "varimax",
fa="mle")
#The rotated factorial matrix is shown.
print(modelo_varimax$loadings,cut=0)
##
## Loadings:
## MR1 MR2 MR5 MR4 MR3
## Year 0.017 -0.020 0.179 0.040 0.125
## Life.expectancy 0.369 -0.061 0.805 0.238 0.301
## Adult.Mortality -0.184 0.001 -0.708 -0.115 -0.114
## infant.deaths -0.172 0.977 -0.042 -0.029 -0.098
## Alcohol 0.519 0.020 0.079 0.263 0.147
## percentage.expenditure 0.205 -0.029 0.112 0.902 0.008
## Hepatitis.B 0.030 -0.132 0.077 -0.025 0.529
## Measles -0.085 0.475 -0.053 -0.026 -0.108
## BMI 0.514 -0.112 0.363 0.103 0.156
## under.five.deaths -0.171 0.971 -0.065 -0.029 -0.117
## Polio 0.172 -0.069 0.196 0.078 0.700
## Total.expenditure 0.324 -0.062 0.014 0.095 0.097
## Diphtheria 0.169 -0.065 0.180 0.060 0.828
## HIV.AIDS -0.043 0.004 -0.657 0.019 -0.020
## GDP 0.197 -0.045 0.180 0.873 0.059
## Population -0.074 0.541 0.053 0.002 -0.004
## thinness..1.19.years -0.845 0.333 -0.187 -0.027 0.003
## thinness.5.9.years -0.835 0.339 -0.190 -0.028 0.007
## Resource.Income 0.421 0.000 0.462 0.322 0.287
## Schooling 0.485 -0.025 0.430 0.321 0.316
##
## MR1 MR2 MR5 MR4 MR3
## SS loadings 2.846 2.692 2.355 1.957 1.851
## Proportion Var 0.142 0.135 0.118 0.098 0.093
## Cumulative Var 0.142 0.277 0.395 0.492 0.585
The following diagrammatic representation is used to illustrate with which variables each of the factors is correlated.
fa.diagram(modelo_varimax,space=15)
Another way to do the previous analysis.
#This function only performs the mle method (maximum-likelihood estimator method).
FA<-factanal(expect_num,factors=5, rotation="varimax")
FA$loadings
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4 Factor5
## Year 0.154 0.116
## Life.expectancy 0.882 0.273 0.208 0.263
## Adult.Mortality -0.680 -0.169 -0.112
## infant.deaths 0.974 -0.188
## Alcohol 0.223 0.385 0.253 0.135
## percentage.expenditure 0.154 0.164 0.925
## Hepatitis.B -0.116 0.552
## Measles 0.490 -0.103
## BMI 0.416 -0.110 0.455 0.103 0.150
## under.five.deaths 0.972 -0.184 -0.113
## Polio 0.247 0.131 0.704
## Total.expenditure 0.252 0.100
## Diphtheria 0.226 0.137 0.832
## HIV.AIDS -0.599
## GDP 0.203 0.169 0.893
## Population 0.542
## thinness..1.19.years -0.201 0.290 -0.899
## thinness.5.9.years -0.196 0.296 -0.897
## Resource.Income 0.559 0.283 0.283 0.249
## Schooling 0.563 0.314 0.276 0.270
##
## Factor1 Factor2 Factor3 Factor4 Factor5
## SS loadings 2.759 2.653 2.499 1.973 1.811
## Proportion Var 0.138 0.133 0.125 0.099 0.091
## Cumulative Var 0.138 0.271 0.396 0.494 0.585
It is observed that the code provides the factor loadings, indicating the relationships between the original variables and the extracted factors. The closer to 1 or -1, the stronger the relationship, and the closer to 0, the weaker. Additionally, if the number is positive, it indicates a direct relationship, and if negative, an inverse one.
The dataset contains information on various social indicators measured in different countries, including the country’s life expectancy indicator. Therefore, it would be interesting to provide a classification method for the lifespan based on the other indicators. To do this, a qualitative variable is defined, which will be the response variable for the classification, called ‘life.’ This variable has two categories:
life=c()
mean = mean(data$Life.expectancy)
for(k in 1:nrow(data)){
if(data$Life.expectancy[k]<mean){
life[k] = "short"
}
else
life[k] = "long"
}
life<-as.factor(life)
First, we explore how well (or poorly) each of the explanatory variables considered independently classifies the life expectancy. To do this, we draw the superimposed histograms. If the histograms are separated, the variable considered would be a good individual classifier for the species.
#Create individual plots for each variable.
p1 <- ggplot(data = data, aes(x = Year, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Year")
p2 <- ggplot(data = data, aes(x = Adult.Mortality, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Adult.Mortality")
p3 <- ggplot(data = data, aes(x = infant.deaths, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of infant.deaths")
p4 <- ggplot(data = data, aes(x = Alcohol, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Alcohol")
p5 <- ggplot(data = data, aes(x = percentage.expenditure, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of percentage.expenditure")
p6 <- ggplot(data = data, aes(x = Hepatitis.B, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Hepatitis.B")
p7 <- ggplot(data = data, aes(x = Measles, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Measles")
p8 <- ggplot(data = data, aes(x = BMI, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of BMI")
p9 <- ggplot(data = data, aes(x = under.five.deaths, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of under.five.deaths")
p10 <- ggplot(data = data, aes(x = Polio, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Polio")
p11<- ggplot(data = data, aes(x = Total.expenditure, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Total.expenditure")
p12 <- ggplot(data = data, aes(x = Diphtheria, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Diphtheria")
p13 <- ggplot(data = data, aes(x = HIV.AIDS, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of HIV.AIDS")
p14 <- ggplot(data = data, aes(x = GDP, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of GDP")
p15<- ggplot(data = data, aes(x = Population, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Population")
p16<- ggplot(data = data, aes(x = thinness..1.19.years , fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of thinness..1.19.years")
p17<- ggplot(data = data, aes(x = thinness.5.9.years, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of thinness.5.9.years")
p18<- ggplot(data = data, aes(x = Resource.Income, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Resource.Income")
p19<- ggplot(data = data, aes(x = Schooling, fill = life)) +
geom_histogram(position = "identity", alpha = 0.5) +
labs(title = "Histogram of Schooling")
p1;p2;p3;p4;p5;p6;p7;p8;p9;p10;p11;p12;p13;p14;p15;p16;p17;p18;p19
The variable Schooling is the one that best differentiates between long and short lifespan (least overlapping).
Next we make a graphical exploration of the normality of the univariate distributions of our predictors by representing the histograms and performing a Saphiro test.
Univariate histograms
# Histogram representation of each variable.
par(mfcol = c(2, 2))
# Loop through selected variables
for (variable in 1:20) {
# Create a histogram
hist_data <- data[[variable]]
hist(hist_data, probability = TRUE, col = grey(0.8), main = paste("Histogram of", variable), xlab = variable)
# Overlay a normality curve
curve(dnorm(x, mean = mean(hist_data), sd = sd(hist_data)), col = "blue", lwd = 2, add = TRUE)
}
# Reset the layout to default
par(mfcol = c(1, 1))
Qqplots
par(mfrow=c(2,2))
for (k in 1:19) {
j0 <- names(data)[k]
x0 <- seq(min(data[, k]), max(data[, k]), le = 50)
for (i in 1:2) {
i0 <- levels(life)[i]
x <- data[life == i0, j0]
qqnorm(x, main = paste("life expectancy", i0, j0), pch = 19, col = i + 1)
qqline(x)
}
}
Based on the graphical techniques employed, it is evident that we will not achieve univariate normality in our variables.
Univariate normality test (Shapiro-Wilks)
The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilks test is lesser than 0.05. Otherwise the assumption of normality of the data is not rejected.
datos1<- data
datos1[,2]=life
datos_tidy <- melt(datos1, value.name = "value")
result <- aggregate(value ~ Life.expectancy + variable, data = datos_tidy,
FUN = function(x) shapiro.test(x)$p.value)
result
The hypothesis of normality is rejected as the p-value given by the Shapiro-Wilks test is lesser than 0.05.
We encounter trouble as the MVN function can only be used with less than 2000 observations and we have more. An Energy Test is another statistical test that determines whether or not a group of variables follows a multivariate normal distribution. The null and alternative hypotheses for the test are as follows:
H0 (null): The variables follow a multivariate normal distribution.
Ha (alternative): The variables do not follow a multivariate normal distribution.
#Perform the Henze-Zirkler Multivariate Normality Test
mvnorm.etest(data, R=100)
##
## Energy test of multivariate normality: estimated parameters
##
## data: x, sample size 2938, dimension 20, replicates 100
## E-statistic = 31.93, p-value < 2.2e-16
Given that our p-value < 0.05, we’ll reject the null hypothesis. Therefore, we have evidence to suggest that the variables in the dataset do not follow a multivariate normal distribution. We can also use the following code:
hz_test <- mvn(data = data, mvnTest = "hz")
hz_test$multivariateNormality
When there is a single predictor the most recommended test is the Barttlet test.When multiple predictors are used, it must be verified that the covariance matrix is constant in all groups. In this case it is also advisable to check the homogeneity of the variance for each predictor at the individual level.
The most recommended test is the Box M test, which is an extension of the Barttlet test for multivariate scenarios. It must be taken into account that it is very sensitive to whether the data are actually distributed according to a multivariate normal. For this reason, it is recommended to use a significance p-value < 0.001 to reject the null hypothesis.
The null hypothesis to be tested is that of equality of covariance matrices in all groups.
datos<- data[,-2]
boxM(data = data[, 1:length(datos)], grouping = life)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: data[, 1:length(datos)]
## Chi-Sq (approx.) = 8882.8, df = 190, p-value < 2.2e-16
In this case we do reject the null hypothesis since p-value < 0.001 and therefore we can’t assume homogeneity of variances.
It is important to remember that for this conclusion to be reliable the assumption of multivariate normality must be met, which, wasn’t the case. In fact, when there is no multivariate normal distribution, this test always comes out significant and therefore is not reliable.
Now we fit a linear discriminant classification model, but we’ll have to keep in mind that the assumptions of multivariate normality and homogeneity of variance aren’t satisfied.
modelo_lda <- lda(formula = life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling ,data = data)
modelo_lda
## Call:
## lda(life ~ Year + Adult.Mortality + infant.deaths + Alcohol +
## percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths +
## Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP +
## Population + thinness..1.19.years + thinness.5.9.years +
## Resource.Income + Schooling, data = data)
##
## Prior probabilities of groups:
## long short
## 0.5707965 0.4292035
##
## Group means:
## Year Adult.Mortality infant.deaths Alcohol percentage.expenditure
## long 2007.776 101.9585 1.920391 5.916319 106.46302
## short 2007.176 212.5657 4.757941 2.825644 52.68684
## Hepatitis.B Measles BMI under.five.deaths Polio
## long 90.38385 25.91884 47.21801 3.007014 94.31923
## short 84.37839 48.13386 26.48947 7.468710 90.49733
## Total.expenditure Diphtheria HIV.AIDS GDP Population
## long 6.080152 94.21127 0.1126262 4004.353 6040694
## short 5.202103 90.26405 0.1856046 2259.682 5980829
## thinness..1.19.years thinness.5.9.years Resource.Income Schooling
## long 2.841068 2.784720 0.7552718 13.74616
## short 5.858415 5.935501 0.5253868 10.38843
##
## Coefficients of linear discriminants:
## LD1
## Year 1.132738e-02
## Adult.Mortality 4.340997e-03
## infant.deaths 1.262522e-02
## Alcohol -1.970419e-02
## percentage.expenditure -9.283862e-04
## Hepatitis.B -1.582141e-02
## Measles 2.458324e-03
## BMI -4.536101e-03
## under.five.deaths -1.064769e-03
## Polio -3.381613e-03
## Total.expenditure -3.877981e-02
## Diphtheria -3.881977e-03
## HIV.AIDS 5.853773e+00
## GDP -1.935186e-05
## Population -1.312976e-08
## thinness..1.19.years -8.493223e-02
## thinness.5.9.years 1.001592e-01
## Resource.Income -4.656544e+00
## Schooling -6.379008e-02
The output of this object shows us the prior probabilities of each group, in this case 0.5707965 and 0.4292035 for long and short, respectively, the means of each regressor per group and the coefficients of the linear discriminant classification model, which in this case would have the form:
odds = 1.132738e-02Year+ 4.340997e-03Adult.Mortality - 1.262522e-02infant.deaths - 1.970419e-02Alcohol - 9.283862e-04percentage.expenditure - 1.582141e-02Hepatitis.B - 2.458324e-03Measles - 4.536101e-03BMI - 1.064769e-03under.five.deaths - 3.381613e-03Polio - 3.877981e-02Total.expenditure - 3.881977e-03Diphtheria + 5.853773e+00HIV.AIDS - 1.935186e-05GDP - 1.312976e-08Population - 8.493223e-02thinness..1.19.years + 1.001592e-01thinness.5.9.years - 4.656544e+00Resource.Income - 6.379008e-02Schooling
Once the classifier is built, we can classify new data based on its measurements by simply calling the predict function.
The confusionmatrix function from the biotools package performs cross-validation of the classification model.
pred <- predict(modelo_lda, dimen = 1)
confusionmatrix(life, pred$class)
## new long new short
## long 1570 107
## short 213 1048
#Percentage of classification errors
trainig_error <- mean(life != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 10.8917631041525 %"
In this case the correct classifications rate is 89,11% approximately.
From a geometric point of view, linear discriminant analysis separates space using a straight line. In this sense, the partimat function of the klaR package allows us to represent the classification limits of a linear or quadratic discriminant model for each pair of predictors. Each color represents a classification region according to the model, the centroid of each region and the real value of the observations are shown.
partimat(life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling,
data = data, method = "lda", prec = 200,
image.colors = c("green", "orange"),
col.mean = "yellow",nplots.vert = 1, nplots.hor=3)
Unlike the classification with the explanatory variables, which has an error of 10.89176%, when considering the classification according to each pair of variables, larger errors are made, from 0.106 to 0.429. The error rate has the majority values in the rank 0.17 and 0.3.
In view of the graphs, it is observed that the pair of predictors that best classifies the data is Adult.Mortality and Resource.Income. If a classification model with two predictors were to be built, this pair would be considered.
Just like for Linear Discriminant Analysis, to perform Quadratic Discriminant Analysis, we start with the graphical exploration of data and checks on univariate and multivariate normality and homogeneity of variances, which have already been conducted earlier.
Although the assumption of multivariate normality is not met, considering that variances are not homogeneous, a quadratic discriminant model is fitted because it is robust against the lack of this assumption. However, it should be noted, given the possibility of obtaining unexpected results.
modelo_qda<- qda(formula = life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling,data = data)
modelo_qda
## Call:
## qda(life ~ Year + Adult.Mortality + infant.deaths + Alcohol +
## percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths +
## Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP +
## Population + thinness..1.19.years + thinness.5.9.years +
## Resource.Income + Schooling, data = data)
##
## Prior probabilities of groups:
## long short
## 0.5707965 0.4292035
##
## Group means:
## Year Adult.Mortality infant.deaths Alcohol percentage.expenditure
## long 2007.776 101.9585 1.920391 5.916319 106.46302
## short 2007.176 212.5657 4.757941 2.825644 52.68684
## Hepatitis.B Measles BMI under.five.deaths Polio
## long 90.38385 25.91884 47.21801 3.007014 94.31923
## short 84.37839 48.13386 26.48947 7.468710 90.49733
## Total.expenditure Diphtheria HIV.AIDS GDP Population
## long 6.080152 94.21127 0.1126262 4004.353 6040694
## short 5.202103 90.26405 0.1856046 2259.682 5980829
## thinness..1.19.years thinness.5.9.years Resource.Income Schooling
## long 2.841068 2.784720 0.7552718 13.74616
## short 5.858415 5.935501 0.5253868 10.38843
The output of this object shows us the prior probabilities of each group, in this case 0.5707965 and 0.4292035 and the means of each regressor per group.
Once the classifier is built, we can classify new data based on its measurements by simply calling the predict function.
The confusionmatrix function from the biotools package performs cross-validation of the classification model.
pred <- predict(modelo_qda, dimen = 1)
confusionmatrix(life, pred$class)
## new long new short
## long 1578 99
## short 208 1053
trainig_error <- mean(life != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 10.4492852280463 %"
In this case the correct classifications rate is 89.56% approximately, a little over the LDA case.
The partimat function of the klaR package allows us to represent the classification limits of a linear or quadratic discriminant model for each pair of predictors. Each color represents a classification region according to the model, the centroid of each region and the real value of the observations are shown.
partimat(life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling,
data = expect_num, method = "qda", prec = 200,
image.colors = c("darkgoldenrod1", "snow2"),
col.mean = "firebrick",nplots.vert =1, nplots.hor=3)